Selection of Relevant Features in Machine LearningPat
نویسنده
چکیده
In this paper, we review the problem of selecting relevant features for use in machine learning. We describe this problem in terms of heuristic search through a space of feature sets, and we identify four dimensions along which approaches to the problem can vary. We consider recent work on feature selection in terms of this framework, then close with some challenges for future work in the area. The selection of relevant features, and the elimination of irrelevant ones, is a central problem in machine learning. Before an induction algorithm can move beyond the training data to make predictions about novel test cases, it must decide which attributes to use in these predictions and which to ignore. Intuitively, one would like the learner to use only those attributes that arèrelevant' to the target concept. There have been a few attempts to deenèrelevance' in the context of machine learning, as John, Kohavi, and PPeger (1994) have noted in their review of this topic. Because we will review a variety of approaches, we do not take a position on this issue here. We will focus instead on the task of selecting relevant features (however deened) for use in learning and prediction. Many induction methods attempt to deal directly with the problem of attribute selection, especially ones that operate on logical representations. For instance, techniques for inducing logical conjunctions do little more than add or remove features from the concept description. Addition and deletion of single attributes also constitute the basic operations of more sophisticated methods for inducing decision lists and decision trees. Some nonlogical induction methods, like those for neural networks and Bayesian classiiers, instead use weights to assign degrees of relevance to attributes. And some learning schemes, such as the simple nearest neighbor method, ignore the issue of relevance entirely. We would like induction algorithms that scale well to domains with many irrelevant features. More specifically , we would like the sample complexity (the number of training cases needed to reach a given level of accuracy) to grow slowly with the number of irrelevant attributes. Theoretical results for algorithms that search restricted hypothesis spaces are encouraging. For instance, the worst-case number of errors made by Littlestone's (1987) Winnow method grows only logarithmically with the number of irrelevant features. Pazzani and Sarrett's (1992) average-case analysis for Wholist, a simple conjunctive algorithm, and Lang-ley and Iba's (1993) treatment of the naive Bayesian classiier, suggest …
منابع مشابه
Fast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملModeling and design of a diagnostic and screening algorithm based on hybrid feature selection-enabled linear support vector machine classification
Background: In the current study, a hybrid feature selection approach involving filter and wrapper methods is applied to some bioscience databases with various records, attributes and classes; hence, this strategy enjoys the advantages of both methods such as fast execution, generality, and accuracy. The purpose is diagnosing of the disease status and estimating of the patient survival. Method...
متن کاملFeature Selection Using Multi Objective Genetic Algorithm with Support Vector Machine
Different approaches have been proposed for feature selection to obtain suitable features subset among all features. These methods search feature space for feature subsets which satisfies some criteria or optimizes several objective functions. The objective functions are divided into two main groups: filter and wrapper methods. In filter methods, features subsets are selected due to some measu...
متن کاملA Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection
Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...
متن کاملAn Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کامل